SoundCloud Analyis

Author

Hajara Muzammal

Introduction:

Music discovery increasingly occurs through playlists, where listeners curate and share collections of songs across platforms such as SoundCloud. Understanding which musical characteristics are associated with popularity and playlist inclusion can provide insight into listener preferences and emerging trends. This analysis focuses on tracks that include SoundCloud links, allowing us to study how audio features, sentiment, and popularity metrics relate to playlist usage.

Using a dataset of approximately 15,000 tracks with audio features, playlist metadata, and SoundCloud URLs, this report explores the relationship between popularity and musical characteristics such as danceability, energy, tempo, and key. The goal is not to predict popularity with a formal statistical model, but rather to visually and descriptively identify patterns that distinguish popular songs from less popular ones.

This work contributes to the broader question of how music spreads across user-generated platforms and provides a data-driven perspective on what makes songs more likely to appear in playlists.

Overarching Question:

What factors influence the popularity of songs across major music streaming platforms?

My Question:

Is track popularity associated with playlist inclusion?

Data Ingest

We use a publicly available dataset hosted on Hugging Face, which contains playlist metadata, song characteristics, and direct SoundCloud links.

Show code
library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)

url <- "https://huggingface.co/datasets/Zuru7/Spotify_Songs_with_SoundCloud_links/resolve/main/song_df_normalised.csv"
SONGS_raw <- read_csv(url, show_col_types = FALSE)

# Standardize names (works even if you re-run the doc)
SONGS <- SONGS_raw %>%
  rename(
    track          = any_of(c("track", "track_name")),
    artist         = any_of(c("artist", "track_artist")),
    album          = any_of(c("album", "track_album_name")),
    popularity     = any_of(c("popularity", "track_popularity")),
    playlist_genre = any_of(c("genre", "playlist_genre")),
    playlist_subgenre = any_of(c("subgenre", "playlist_subgenre")),
    soundcloud_link = any_of(c("soundcloud_link", "links"))
  ) %>%
  filter(!is.na(track), !is.na(artist), !is.na(popularity))
glimpse(SONGS)
Rows: 14,987
Columns: 23
$ track             <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ artist            <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ lyrics            <chr> "the trees, are singing in the wind the sky blue, on…
$ album             <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ danceability      <dbl> 0.2166860, 0.8447277, 0.3580533, 0.7462341, 0.440324…
$ energy            <dbl> 0.8779620, 0.6460897, 0.3674362, 0.8850809, 0.632868…
$ key               <dbl> 0.81818182, 0.54545455, 0.45454545, 0.81818182, 0.54…
$ loudness          <dbl> 0.7817377, 0.6813893, 0.7425419, 0.8813965, 0.730275…
$ mode              <dbl> 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 1, 1…
$ speechiness       <dbl> 0.02434122, 0.21616793, 0.01306387, 0.02065654, 0.03…
$ acousticness      <dbl> 0.011792960, 0.004353434, 0.694556021, 0.037297028, …
$ instrumentalness  <dbl> 0.010205339, 0.007422998, 0.000000000, 0.000000000, …
$ liveness          <dbl> 0.34221195, 0.48613476, 0.05781237, 0.13038190, 0.08…
$ valence           <dbl> 0.4080748, 0.6565622, 0.4090849, 0.2424166, 0.308073…
$ tempo             <dbl> 0.5545093, 0.4227024, 0.4605076, 0.5250801, 0.625378…
$ language          <chr> "en", "en", "en", "en", "en", "en", "en", "en", "es"…
$ sentiment         <chr> "Positive", "Positive", "Positive", "Negative", "Pos…
$ song_artist       <chr> "i feel alive steady rollin", "poison bell biv devoe…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

The dataset contains 14,987 observations and 23 variables, including track-level metadata, playlist context, sentiment labels, and normalized audio features such as danceability, energy, and valence. Each observation represents a song associated with at least one playlist and includes a direct SoundCloud link.

Data Cleaning

Show code
PLAYLIST_TABLE <- SONGS %>%
  transmute(
    playlist_name   = playlist_name,
    artist_name     = artist,
    track_name      = track,
    album_name      = album,
    popularity      = popularity,
    playlist_genre  = playlist_genre,
    playlist_subgenre = playlist_subgenre,
    soundcloud_link = soundcloud_link
  )

glimpse(PLAYLIST_TABLE)
Rows: 14,987
Columns: 8
$ playlist_name     <chr> "hard rock workout", "back in the day - r&b, new jac…
$ artist_name       <chr> "steady rollin", "bell biv devoe", "ceelo green", "k…
$ track_name        <chr> "i feel alive", "poison", "baby it's cold outside (f…
$ album_name        <chr> "love & loss", "gold", "ceelo's magic moment", "kard…
$ popularity        <dbl> 28, 0, 41, 65, 70, 52, 36, 42, 1, 58, 69, 72, 74, 41…
$ playlist_genre    <chr> "rock", "r&b", "r&b", "pop", "r&b", "r&b", "r&b", "e…
$ playlist_subgenre <chr> "hard rock", "new jack swing", "neo soul", "dance po…
$ soundcloud_link   <chr> "http://soundcloud.com/xobak3r/purple-vision-ft-xoro…

Here we cleaned the dataset by first removing any tracks with missing popularity values to ensure that all subsequent analyses were based on complete and comparable observations. I then defined a threshold for popularity using the 75th percentile of the popularity distribution, which provides a data-driven way to distinguish more popular songs from the rest of the catalog. Finally, I created a binary indicator variable (is_popular) that labels each track as popular or not, allowing for clearer comparisons between popular and less popular songs in later visualizations and analyses

Data Exploration

W define a “popular song” as one with a popularity that is greater than or equal to 70.

Show code
ppop_threshold <- 70
ppop_threshold
[1] 70
Show code
track_counts <- PLAYLIST_TABLE %>%
  distinct(playlist_name, track_name, artist_name, popularity) %>%
  count(track_name, artist_name, popularity, name = "playlist_appearances")

glimpse(track_counts)
Rows: 14,987
Columns: 4
$ track_name           <chr> "$20 fine", "$ave dat money (feat. fetty wap & ri…
$ artist_name          <chr> "jimi hendrix", "lil dicky", "max frost", "queen"…
$ popularity           <dbl> 44, 69, 43, 60, 0, 39, 83, 75, 50, 48, 55, 68, 5,…
$ playlist_appearances <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…

Popularity vs Playlist Appearances

Show code
ggplot(track_counts, aes(x = popularity, y = playlist_appearances)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Popularity vs Playlist Appearances",
    x = "Track Popularity",
    y = "Number of Playlist Appearances"
  ) +
  theme_minimal(base_size = 13)

This plot shows a weak relationship between track popularity and the number of playlist appearances, with most tracks appearing in only one playlist regardless of popularity score. While a small number of moderately to highly popular tracks appear in multiple playlists, overall playlist inclusion does not strongly increase with popularity in this dataset.

Most danceable songs

Show code
SONGS %>%
  arrange(desc(danceability)) %>%
  select(track, artist, danceability, popularity, soundcloud_link) %>%
  slice_head(n = 5)
# A tibble: 5 × 5
  track                           artist danceability popularity soundcloud_link
  <chr>                           <chr>         <dbl>      <dbl> <chr>          
1 ice ice baby                    vanil…        1             70 http://soundcl…
2 cha cha slide - original live … dj ca…        0.999         54 http://soundcl…
3 funky friday                    dave          0.995         72 http://soundcl…
4 bad bad bad (feat. lil baby)    young…        0.994         81 http://soundcl…
5 cinnamon girl - radio edit      [dunk…        0.994         47 http://soundcl…

The number one danceable track is ice ice baby.

## Danceability vs Popularity

Show code
ggplot(SONGS, aes(x = danceability, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "Danceability vs Popularity",
    x = "Danceability",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

This plot shows a weak but positive relationship between danceability and track popularity, indicating that more danceable songs tend to be slightly more popular on average. However, the wide dispersion of points suggests that danceability alone is not a strong predictor of popularity, and highly popular songs exist across a broad range of danceability values

Tempo vs Popularity

Show code
ggplot(SONGS, aes(x = tempo, y = popularity)) +
  geom_point(alpha = 0.25) +
  geom_smooth(method = "loess", se = FALSE) +
  labs(
    title = "Tempo vs Popularity",
    x = "Tempo",
    y = "Popularity"
  ) +
  theme_minimal(base_size = 13)

The relationship between tempo and popularity appears weak and non-linear, with popularity remaining relatively stable across most tempo values. This suggests that tempo alone does not strongly influence a song’s popularity on playlists.

Conclusion:

Overall, the analysis suggests that while playlist exposure and audio features are related to popularity, no single characteristic fully explains why a song becomes popular on SoundCloud-linked Spotify playlists. Playlist appearances show only a weak relationship with track popularity, indicating that many popular tracks appear in relatively few playlists, while less popular tracks can still circulate widely. Danceability and tempo exhibit mild positive associations with popularity, implying that more rhythmically engaging songs tend to perform slightly better, though the effect is not strong. Popular songs are also concentrated at moderate-to-high energy levels and generally exhibit balanced valence, suggesting that listeners gravitate toward songs that are energetic but emotionally neutral to positive rather than extremely sad or euphoric. Taken together, these findings highlight that popularity is multifaceted: audio features contribute to success, but playlist dynamics, listener behavior, and external factors likely play an equally important role in shaping which songs gain widespread attention.